Learning in embodied action-perception loops 

through exploration 
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Abstract — Although exploratory behaviors are ubiquitous in the animal kingdom, their computational underpinnings are still largely 
unknown. Behavioral Psychology has identified learning as a primary drive underlying many exploratory behaviors. Exploration is seen 
as a means for an animal to gather sensory data useful for reducing its ignorance about the environment. While related problems have 
been addressed in Data Mining and Reinforcement Learning, the computational modeling of learning-driven exploration by embodied 
agents is largely unrepresented. 

Here, we propose a computational theory for learning-driven exploration based on the concept of missing information that allows an 
agent to identify informative actions using Bayesian inference. We demonstrate that when embodiment constraints are high, agents 
must actively coordinate their actions to learn efficiently Compared to earlier approaches, our exploration policy yields more efficient 
learning across a range of worlds with diverse structures. The improved learning in turn affords greater success in general tasks 
including navigation and reward gathering. We conclude by discussing how the proposed theory relates to previous information- 
theoretic objectives of behavior, such as predictive information and the free energy principle, and how it might contribute to a general 
theory of exploratory behavior. 

Index Terms — Knowledge acquisition. Information theory. Control theory Machine learning. Psychology, Computational neuroscience. 
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1 Introduction 

EXPLORATORY behaviors have been observed and 
studied in diverse species across the animal king- 
dom. As one example, approach and investigation of 
novel stimuli have been studied in vertebrates ranging 
from fish to birds, reptiles, and mammals |21| , |40j , 
| |4T| , [60J , |68J , 1 ^75 J . As another, open field and maze 
experimental paradigms for studying locomotive explo- 
ration in mice and rats have recently been adapted to 
behavioral studies in zebrafish [64J , |^5|. Indeed, ex- 
ploratory behaviors have even been described across a 
range of invertebrates fT2], f26l, fSTl, {52\. The prevalence 
of exploratory behaviors across animal species suggests a 
fundamental evolutionary advantage, largely believed to 
derive from the utility of information acquired through 
such behaviors fSS), fSOJ, |^, ^55J, ||6|. 

Computational models of exploratory behavior, de- 
veloped predominantly in the field of Reinforcement 
Learning (RL), have largely focused on the role of explo- 
ration in the acquisition of external rewards ||7|, p2) , 
fTOl, fTTl. An agent that strictly maximizes the 
acquisition of known rewards might fall short in finding 
new, previously unknown, sources of reward. Reward 
maximization therefore requires balancing between di- 
rected harvesting of known rewards (exploitation) and 
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the search for new rewards (exploration) |[TJ, |[7j, p2) , 
134|, ||5^, ||7T|. The emphasis in the RL literature on 
reward acquisition, however, stands in contrast to the 
dominant psychological theories of exploration. Quoting 
D. E. Berljme, a pioneer in the psychology of exploration: 

As knowledge accumulated about the condi- 
tions that govern exploratory behavior and 
about how quickly it appears after birth, it 
seemed less and less likely that this behavior 
could be derivative of hunger, thirst, sexual 
appetite, pain, fear of pain, and the like, or 
that stimuli sought through exploration are wel- 
comed because they have previously accompa- 
nied satisfaction of these drives. |11| 

Berlyne further suggested "the most acute motivational 
problems . . . are those in which the perceptual and 
intellectual activities are engaged in for their own sake 
and not simply as aids to handle practical problems" 
| [T0| . In fact, a consensus has emerged in behavioral 
psychology that learning represents the primary drive of 
exploratory behaviors 0, |[37j, jjS^, |j62j. To address this 
gap between computational modeling and behavioral 
psychology, we introduce here a mathematical frame- 
work for studying how behavior effects learning and 
develop a novel model of learning-driven exploration. 

In Computational Neuroscience, machine learning 
techniques have been successfully applied towards mod- 
eling how the brain might learn the structure underly- 
ing sensory signals, e.g., ||15|, ||16J, 1,18^, 1|36J, 1,471, |54|, 
| [59j . Generally, these methods focus on passive learning 
where the learning system can not directly effect the 
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sensory input it receives. Exploration, in contrast, is 
inherently active, and can only occur in the context of a 
closed-action perception loop. Learning in closed action- 
perception loops differs from passive learning in two 
important aspects |22|. First, a learning agent's internal 
model of the world must keep track of how actions 
change the sensory input. Sensorimotor contingencies, 
such as the way visual scenes change as we shift our 
gaze or move our head, must be taken into account to 
properly attribute changes in sensory signals to their 
causes. This is perhaps reflected in neuroanatomy where 
tight sensory-motor integration has been reported at all 
levels of the brain |23|, |[24). Though often taken for 



granted, sensor-motor contingencies must actually be 
learned during the course of development as is elo- 
quently expressed rn the explorative behaviors of yoimg 
infants (e.g., grasping and manipulating objects during 
proprioceptive exploration or bringing them into visual 
view during intermodal exploration) |j45j, |j48j, | [57) . 

The second crucial aspect of learning in a closed 
action-perception loop is that actions direct the acqui- 
sition of sensory data. To discover what is inside an 
unfamiliar box, a curious child must open it. To learn 
about the world, scientists perform experiments. Direct- 
ing the acquisition of data is particularly important for 
embodied agents whose actuators and sensors are phys- 
ically confined. Since the most informative data may not 
always be accessible to a physical sensor, embodiment 
may constrain an exploring agent and require that it 
coordinates its actions to retrieve useful data. 

In the model we propose here, an agent moving 
between discrete states in a world has to learn how its ac- 
tions influence its state transitions. The underlying tran- 
sition dynamics are governed by a Controllable Markov 
Chain (CMC). Within this simple framework, various 
utility functions for guiding exploratory behaviors will 
be studied, as well as several methods for coordinating 
actions over time. The different exploratory strategies are 
compared in their rate of learning. 



2 Model 

2.1 Mathematical framework for embodied active 
learning 

Controllable Markov chains (CMCs) are a simple extension 
of Markov chains that incorporate a control variable for 
switching between different transition distributions p9) . 
Formally, a CMC is a 3-tuple {S, A, 0) where: 

• 5 is a finite set of the possible states of the system 
(for example, the possible location of an agent in its 
world). N = \S\ 

• ^ is a finite set of the possible control values, i.e., 
the actions the agent can choose. M = \A\ 

• is a 3-dimensional CMC kernel describing the 
probability of transitions between states given an 
action (for example, the probability an agent moves 



from state s to state s' when it chooses action a): 

Pa,s(s'|0) = &a.s,s' (1) 

I^Qa,.,.' = 1 

s' 

For any fixed action, &a.:.: is a (two-dimensional) 
stochastic matrix describing a Markov process. Each 
column in this matrix 0a, s,: defines a transition distri- 
bution which is a categorical (finite discrete) distribution 
specifying the likelihoods for the next state. 

The CMC provides a simple mathematical framework 
for modeling exploration in embodied action-perception 
loops. At every time step, an exploring agent is allowed 
to freely select any action a E A. The learning task of the 
exploring agent is to build from observed transitions an 
estimate, the internal model 0, of the true CMC kernel, 
the zvorld 0. We assume that the explorer begins with 
limited information about the world in the form of a 
prior and must improve its estimate by acting in the 
world and gathering observations. The states can be 
directly observed by the agent, i.e. the system is not 
hidden. In the CMC framework, an agent's immediate 
ability to interact with and observe the world is limited 
by the current state. This restriction models the embod- 
iment of the agent. To ameliorate the myopia imparted 
by its embodiment, an agent can coordinate its actions 
over time. Our primary question is how action policies 
can optimize the speed and efficiency of learning in 
embodied action-perception loops. 

2.2 Information-theoretic assessment of learning 

To assess an agent's success towards learning, we de- 
termine the missing information Im rn its internal model 
(as proposed by Pfaffelhuber ||49|). We do this by first 
calculating the Kullback-Leibler (KL) divergence of the 
internal model from the world for each transition distri- 
bution: 

DKL(0a,s,: II 0s,a,:) ®s,a,s'log^ \W^) 

s' = l \0s,a,s'/ 

The KL -divergence can be interpreted as the expected 
amount of information in bits, lost when observations 
(following the true distribution) are commiinicated using 
an encoding scheme optimized for the estimated distri- 
bution |13|. The loss is large when the two distributions 
differ greatly and zero when they are identical. The 
missing information is then calculated as the sum of the 
KL-divergences for each transition distribution: 

Im(0||0):= Y1 DKL(0s,a,||0s.a,:) (2) 

ses.aeA 

We will use missing information ^ to assess learning 
and to compare the performance of different explorative 
strategies. Steeper decreases in missing information over 
time represent faster learning and thus more efficient 
explorative strategies. Table 1 has been included as a 
reference guide for the various measures discussed in 
this manuscript for assessing or guiding exploration. 
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TABLE 1 
Table of Information Measures 



Name used here. Abbreviation (Equation Number) 



Name used in [Reference] 



Mathematical expression 



Missing Information, Im i 2 i 

Information Gain, Iq |io[ i 
Predicted Information Ga^n, PIG ^ 

Posterior Expected Information Gain, PEIG 
Predicted Mode Change, PMC { 19 1 



Predicted Li Change, PLC (20i 







Missing Information |49| 

Information Gain |44| 
KL — Divergence \67\ or Surprise |29| 

Probability Gain [44J 
Impact 



^ DKL(©s,a,:||es,a,:) 

Im(0)-Im(0"''''''*) 



0a,s,s /~\ 
, — max 



ly^lga,-,-; _g 

j\j / J a,s,s' a,s,s 



2.3 Bayesian inference learning in an agent 

During exploration, a learning agent will gather data in 
the form of experienced state transitions. At every time 
step, it can update its internal model © with its last 
observation. 

Assuming the transition probabilities are drawn from 
a prior distribution f (a distribution over CMC kernels), 
and letting d be the history of observed transitions, the 
Bayesian estimate @s.a.s' for the probability of transi- 
tioning to state s' from state s under action a is given 
by: 



^a.s.s' 



Pa,s{s'\d) 

Pa,s{s',&\d)d@ 

Pa,s{s'\&,d)f{&\d)d@ 

Pa,s{s'\&)f{@\d)d& 



& 



© 



- / @a,s,s'fi&\d)d& 



& 



E©[0a,s,s'|d] E©|3[0a,s,s'] 



(3) 



For discrete priors the above integrals would be replaced 
with summations. At the beginning of exploration, when 
the history d is empty, /(0|d) is simply the prior /(©). 
Equation ^ demonstrates that the Bayesian estimate 
is equivalent to the expected value of the true CMC 
kernel given the data. If we further assume that each 
transition distribution &a.s,: is independently drawn 
from the marginals fa.s of the distribution / and we 
let da s be the history of state transitions experienced 
when taking action a in state s, ||3j simplifies to the 



independent estimation of each transition distribution: 



&a,s.s' = / &a,s,s'f{&\d)d@ 



0a,s,s' fa.s (©a,s,: \ da,s )d&a,s,: 



©a 



"'^©„ s ■ |d„ a [®a,s,s' 



(4) 



In the following theorem, we demonstrate that the 
Bayesian estimator minimizes the expected missing in- 
formation and thus is the best estimate under our objec- 
tive function ||2|. 

Theorem 1: Let and $ be CMC kernels. describes 
the ground truth environment generated from a prior 
distribution. $ is any internal model of the agent. Then 
the expected missing information between and an 
internal model given data d, is minimized by the 
Bayesian estimate 0: 



= argmin£;0 a[lM(0 || *)] 



(5) 



Proof: Since missing information is simply the sum 
of the KL -divergence for each transition kernel l|2j, min- 
imizing missing information is equivalent to indepen- 
dently minimizing these KL-divergences: 



argmin ,3 [Dkl {®a.s,: \\ *a,s,:) 



— are min E, 



®a,3,:|d 



arg min E^ 



-E, 



©o,s,:|d 



E ®a,s,s' log 

s' 

^ @a,s,s' log2 0a 

^ @a,s,s' log2 *a, 



_ s' 
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Fig. 1. Example maze. The 36 states correspond to 
rooms in a maze. Tine 4 actions correspond to noisy 
translations in the cardinal directions. Two transition distri- 
butions are depicted, each by a set of 4 arrows emanating 
from their starting state. Flat-headed arrows represent 
translations into walls, resulting in staying in the same 
room. Dashed arrows represent translation into a portal 
(blue lines) leading to the base state (blue target). The 
shading of an arrow indicates the probability of the transi- 
tion (darker color represents higher probability). 



— are min — E, 



X] ®a,s,s' log2 *a,. 



_ s' 



= argmin - ^ [®a.s.s'] log2 ^a,s,s' 

= arg min ,3 [0a,s,:] ; *a,s,: 

Here H[9](j)\ denotes the cross-entropy fT3\. Then, by 
Gibb's inequality [ ^13| we conclude: 



arg min H E^ ^ ,3 [©a,s,:] ; *a,s,: 

*a,s.: = 

= 0a,s,:(d) 



□ 

The analytical form for the Bayesian estimate will de- 
pend on the prior. In the following section, we introduce 
three classes of CMCs which will be considered in this 
study and specify the Bayesian estimates for each. 

2.4 Three test environments for studying explo- 
ration 

In the course of exploration, the data an agent accumu- 
lates will depend on both its behavioral strategy as well 
as the world structure. Studying diverse environments, 
i.e., CMCs that differ greatly in structure, will help us 
to investigate how world structure effects the relative 
performance of different exploratory strategies and to 
identify action policies that produce efficient learning 
under broad conditions. 



The three different classes of test environments to be 
investigated will be called Dense Worlds, Mazes, and 1- 
2-3 Worlds. For each class, random CMCs are generated 
by drawing the transition distributions from specific dis- 
tributions. These generative distributions are also given 
to the agents as priors for performing Bayesian inference. 

1 ) Dense Worlds correspond to complete directed prob- 
ability graphs with = 10 states and M = 4 actions. 
Each transition distribution is independently drawn 
from a Dirichlet distribution: 



0,,a„: ~ Dir(l) := 

r(a)^ 



B(a) 



(6) 



B(a) 



T{Na) 



T{a) := / t 



The Dirichlet distribution is the conjugate prior of the 
categorical distribution Q and thus a natural distri- 
bution for generating CMCs. It is parametrized by a 
concentration factor a that determines how much weight 
in the Dirichlet distribution is centered at the midpoint 
of the simplex, the space of all possible transition dis- 
tributions. The midpoint corresponds to the uniform 
categorical distribution. For Dense Worlds, we use a con- 
centration parameter a = 1 which results in a uniform 
distribution over the simplex. An example is depicted 
graphically in Fig. SI. 

The Bayesian estimate for Dense Worlds has the fol- 
lowing closed-form expression: 



0. 



K + N 



(7) 



where K is the length of the history s and S^^y is the 
Kronecker delta (i.e. 6x,y is 1 if a; = y and otherwise). 
Equation ||7| reveals that the Bayesian estimate is sim- 
ply the relative frequencies of the observed data with 
the addition of one fictitious count per transition. The 
incorporation of this fictitious observation is referred to 
as Laplace smoothing and is often performed to avoid 
over-fitting |39|. The derivation of Laplace smoothing 
from Bayesian inference over a Dirichlet prior is a well 
known result |38|. 

2) Mazes consist of = 36 states corresponding 
to rooms in a randomly generated 6 by 6 maze and 
M — 4 actions corresponding to noisy translations, each 
biased towards one of the four cardinal directions "up", 
"down", "left" and "right". An example is depicted in 
Fig. [1] Walking into a wall causes the agent to remain in 
its current location. There are 30 transporters randomly 
distributed amongst the walls which lead to a base 
state. Each maze has a single, randomly chosen base 
state (concentric rings in Fig. [ij. All transitions that 
do not correspond to a single translation are assigned 
a probability of zero. The non-zero probabilities are 
drawn from a Dirichlet distribution with concentration 
parameter a = 0.25. The highest probability is assigned 
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to the state corresponding to the cardinal direction of 
the action. The small concentration parameter distributes 
more probability weight in the corners of the simplex 
corresponding to deterministic transitions. This results in 
Maze transitions have strong biases towards an actions 
associated cardinal direction. 

Agents in Mazes must estimate the non-zero transi- 
tions using the Dirichlet prior without knowledge of 
each action's assigned cardinal direction. Similar to l[7|, 
the Bayesian estimate for maze transitions is given by: 



0.25 



K + 0.25 • Na 



(8) 



where Na.s is the number of non-zero probability states 
in the transition distribution &a,s.:- As with Dense 
Worlds, the Bayesian estimate ijSj for mazes is a Laplace 
smoothed histogram. 

3) 1-2-3 Worlds consists of = 20 states and M = 3 
actions. In a given state, action a — 1 moves the agent 
deterministically to a single target state, action a ~ 2 
brings the agent with probability 0.5 to one of two 
possible target states, and action a = 3 brings the 
agent with probability 0.333 to one of 3 potential target 
states. The target states are randomly and independently 
selected for each action taken in each state. To create an 
absorbing state, the probability that state 1 is among the 
targets of action a is set to 1 — 0.75". The probability 
for all other states to be selected as targets is uniform. 
Explicitly, letting be the set of all admissible transition 
distributions for action a: 

:= {e e M^l V 6>,, = 1 and 6,' € {0, ^ }Vs'} 

a 

s' 

the transition distributions are drawn from the following 
distribution: 

if @a s, i 

1 - 0.75" 



P(0a,s,:) 



1- (1-0.75") 



else if 0Q ; 



otherwise 



(9) 



If this process results in a non ergodic CMC, it is 
discarded and a new CMC is generated. A CMC as 
ergodic if, for every ordered pair of states, there exist 
an action policy under which an agent starting at the 
first state will eventually reach the second. An example 
1-2-3 Worlds is depicted in Fig. S2. 

Bayesian inference in 1-2-3 Worlds differs greatly from 
Mazes and Dense Worlds because of its discrete prior 
Essentially, state transitions that have been observed are 
accurately estimated, while the remaining probability 
weight is distributed across those states that have not yet 
been experienced (preferentially to state 1 and uniformly 
across other states). Explicitly, if a, s — s' has been pre- 
viously observed, then the Bayesian estimate for ©a.s.s' 



is given by: 



1 



If a,s —i' s' has not been observed but a,s 1 has, then 
the Bayesian estimate is given by: 



1 



\S'\ 



N -T 



Here T is the number of target states that have already 
been observed: 

T := \{s* e d,, Jl 

Finally, if neither a, s — ^ s' nor a, s — > 1 have been 
observed, then the Bayesian estimate is: 



1 - 0.75" 



1 + (("y^) - 1) *0.75" " 
1 - ('^ + @n . 



i if 



N -T -I 



otherwise 



3 Results 

3.1 Assessing the information-theoretic value of 
planned actions 

The central question to be addressed is how actions 
effect the learning process in embodied action-perception 
loops. Ideally, actions should be chosen so that the 
missing information pi decreases as fast as possible. As 
discussed in Section 2.3 the Bayesian estimate minimizes 
the expected missing information. We will assume that 
an agent continually updates its internal model accord- 
ingly from the observations it receives. The Bayesian 
estimate, however, does not indicate which action will 
optimize the utility of future data. Towards this objec- 
tive, an agent should try to predict the impact a new 
observation will have on its missing information. We 
call the decrease in missing information between two 
internal models the information gain (Iq). Letting be 
a current model derived from data d and Q'* be an 
updated model derived from adding an observation of 
a, s — > s* to d, the information gain for this observation 
is: 

lG{a,s,s*) :-Im(0 || e)-lM(0 || 0"'^'^*) 

= DKL(0a,.,: II 0a,s,)-DKL(0a,.,: || ©"i^f ) 



^ ®a,s,s' log 2 



©a 
0° 







a.s.s' 



(10) 



Calculating the information gained from taking action a 
in state s would therefore require knowing as well as 
s*. An agent can only infer former and can only know 
the latter after it has executed the action. In the following 
theorem, however, we derive a closed-form expression 
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Dense Worlds Mazes 1 -2-3 Worlds 




Realized Information Gain (bits) Realized Information Gain (bits) Realized Information Gain (bits) 



Fig. 2. Accuracy of predicted information gain. Tine average predicted information gain is plotted against tine average 
realized information gain. Averages are taken over 200 CMCs, N x M transition distributions, and 50 trials. Error 
bars depict standard deviations (only plotted above the mean for 1-2-3 Worlds). The arrow indicates the direction of 
increasing numbers of observations (top-right =none, bottom-left=19). The unity lines are drawn in gray. 



for the expected information gain, which we shall call 
the predicted information gain (PIG). 

Theorem 2: Let be a CMC kernel whose transition 
distributions are independently generated from prior 
distributions. If an agent is in state s and has previously 
collected data d, then the expected information gain for 
taking action a and observing the resultant state 5** is 
given by: 

PIG(a,s):=E^._g|3[lG(a,s,s*)] 

= ^0a,,,..DKL(e,,., II 0:;:;f*) (11) 

where is the current internal model of the agent and 
0a,s,s jg what the internal model would become if it 
were updated with an observation s* resulting from a 
prospective new action a. 
Proof: 



= E 



s',0|d 



^ 0a,s.s' log 2 



0! 



= E 



S',©„,s,:|d 

E, 



^ ®a,s,s' log 2 



a,s,s' 



®a.s 



s \d. 



©|d„. 



^@a,s,s' log 2 



E 



E 



EEeid^,.,.. [0a,.,.'] log 2 

. s' 

e; 



a,s,s 



\a,s,s 
a.s.s' 



E,.|3^^ [DKL(0a,.,: II 0S;:f ; 

5]p(s*|4,)DKL(0a,s,: II 02;:;f*) 

s* 

Y.Ba,s,s'^M@a,s, II 0S;:f ) 



□ 

Notice, | [TT| can be computed from previously col- 
lected data alone. For each class of environments. Fig. |2] 
compares the average PIG with the average realized 
information gain as successive observation are drawn 
from a transition distribution and used to update a 
Bayesian estimate. In accordance with Theorem 2, in all 
three environments PIG accurately predicts the average 
information gain. Thus, theoretically and empirically, 
PIG represents an accurate estimate of the average gains 
towards the learning objective functions that an agent 
can expect to receive for taking a planned action in a 
particular state. 

Interestingly, the equation for computing PIG, RHS 
of | [TT) , has been previously considered in the field of 
psychology, where it was applied to describe human 
behavior during hypothesis testing |35l, fS^, f46|. To 
our knowledge, however, its equality to the expected 
decrease in missing information (Theorem 2) has not 
been previously shown. 

3.2 Control learners: unembodied and random ac- 
tion 

During exploration, an embodied agent can choose its 
action but is bound to the state that resulted from its 
last transition. A simple exploratory strategy would be 
to always select actions uniformly randomly. We will use 
such a random action strategy as a baseline control for 
learning performance representing a naive explorer. 

In contrast to embodied agents, one can also consider 
an unembodied agent that is allowed to arbitrarily re- 
locate to a new state before taking an action. For an 
unembodied agent, optimization of learning becomes 
much simpler as it decomposes into an independent 
sampling problem |49|. Since the PIG for each transition 
distribution decreases monotonically over successive ob- 
servations (Fig.|2|, learning by an unembodied agent can 
be optimized by always sampling from the state and 
action pair with the highest PIG. Thus, learning can be 
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Dense Worlds Mazes 1 -2-3 Worlds 




1000 2000 3000 1000 2000 3000 500 1000 

Time (steps) Time (steps) Time (steps) 

Fig. 3. Learning curves for control strategies. The average missing information is plotted over exploration time for the 
unembodied positive control and random action baseline control. Standard errors are plotted as dotted lines above 
and below learning curves. (n=200) 



optimized in a greedy fashion: 

(a, s)unomb. argniaxPIG(a, s) 



(12) 



The learning curves of the unembodied agent will serve 
here as a positive control as it represents an upper bound 
for the performance of embodied agents. 

An initial comparison between random action and 
the unembodied control highlights a notable difference 
among the three classes of environments (Fig. |3|. Specif- 
ically, the performance margin between the two controls 
is significant in Mazes and 1-2-3 Worlds {p < 0.001), 
but not in Dense Worlds {p > 0.01). The significance 
was assessed by post-hoc analysis of Friedman's test 
[25 1 comparing the areas under the two learning curves. 
Despite using a naive strategy, the random actor is 
essentially reaching maximum performance in Dense 
Worlds, suggesting that exploration of this environment 
is fairly easy. The difference in performance between 
random action and the unembodied control offers an ini- 
tial insight into the constraints experienced by embodied 
agents. A directed exploration strategy may help bridge 
this gap. 

3.3 Exploration strategies based on PIG 

Given that PIG can be computed by an agent using only 
the data it has already collected (along with its prior), we 
wondered whether it could be used as a utility function 
to guide exploration. Since greedy maximization of PIG 
is optimal for the unembodied agent, one might con- 
sider a similar greedy strategy for an embodied agent 
(PIG (greedy)). The key difference would be that the 
embodied agent can only select its action but not its 
current state: 



apiG(groody) := arg Hiax PlG(a, s) 



(13) 



The performance comparison between PIG(greedy) (13 1 
and the unembodied control | [T2| is of particular inter- 
est because the two strategies differ only in that one 



is embodied but the other is not. Thus differences of 
their performance reflect the embodiment constraint on 
learning. As shown in Fig. |4] the performance difference 
is largest in Maze worlds, moderate though significant 
in 1-2-3 Worlds and smallest in Dense Worlds (p < 0.001 
for Mazes and 1-2-3 Worlds, p > 0.001 for Dense Worlds). 
To quantify the embodiment constraint faced in a world, 
we define an embodiment index as the relative differ- 
ence between the areas under the learning curves for 
PIG (greedy) and the unembodied control which average 
0.02 for Dense Worlds, 2.59 for Mazes, and 1.27 for 1-2-3 
Worlds. 

Also of particular interest, the comparison between 
PIG(greedy) and random action provides further in- 
sight differentiating the three classes of worlds (Fig. |4|. 
Whereas PIG(greedy) yielded no improvement over ran- 
dom action in Dense Worlds and Mazes (p > 0.001), 
it significantly improved learning in 1-2-3 Worlds(p < 
0.001), demonstrating that agents benefitted from the 
information-theoretic utility function only in 1-2-3 
Worlds. 

Greedy maximization of PIG considers only the imme- 
diate gains available and fails to account for the effect an 
action can have on future utility. In particular, when the 
potential for information gain is unevenly distributed, 
it may be necessary to coordinate actions over time 
to obtain remote but informative observations. Forward 
estimation of total future PIG over multiple time steps 
is intractable as the number of action sequences and 
state outcomes increases exponentially with time. To 
guide an agent towards maximizing long-term gains of 
PIG, we instead employ a back-propagation approach 
previously developed in the field of economics, Value- 
Iteration (VI) [8|. The estimation starts at a distant time 
point (initialized as r = 0) in the future with initial 
values equal to the PIG for each state-action pair: 



?o(a, s) := lG(a, s) 



(14) 



Then propagating backwards in time, we maintain a 
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Fig. 4. Coordinating exploration using predicted information gain. Tine average missing information is plotted over 
exploration time for greedy and value-iterated (VI) maximization of PIG. The standard control strategies and the Vl-i- 
positive control are also depicted. Standard errors are plotted as dotted lines above and below learning curves. (n=200) 



running total of estimated future value by: 

Vr{s) :— mayiQr{a, s) 

a 

Qr-l{a, S) := Ida, s) + 7 ^ ®s,a,s' ■ Vr{s') (15) 

Here, < 7 < 1, is a discount factor reducing the 
value of gains obtained further in the future. When 
7 < 1, backward propagation can be continued until 
convergence. Alternatively, it can simply be executed for 
a predefined number of steps. Choosing the latter with 
7 = 1, we construct a behavioral policy (PIG(VI)) for 
an agent that coordinates its actions under VI towards 
maximizing PIG: 



apiG(vi) := argmax(3_io(a, s); 



(16) 



Comparing the learning curves in Fig. |4] for PIG(VI) 
and PIG(greedy) in the three classes of worlds we find 
that coordination of actions yielded the greatest learning 
gains in Mazes, with moderate gains also seen in 1-2- 
3 Worlds. In Dense Worlds PIG(VI), like PIG(greedy) 
and random action, essentially reached maximal learning 
performance. Along with the results for the embodiment 
index above, these results support the hypothesis that 
worlds with high embodiment constraint require agents 
to coordinate their actions over several time steps to 
achieve efficient exploration. 

Convergence and optimality of the VI algorithm can 
be guaranteed |8|, but only if the utility function is 
stationary and the true world structure is known. To 
assess the impairment resulting from the use of the 
internal model in VI | (T5) , we constructed a second 
positive control, PIG(VI+), which is given the true CMC 
kernel for use during coordinated maximization of 
PIG under VI. Under this strategy, is used only to co- 
ordinate the selection of actions and is not incorporated 
into the Bayesian estimate or the PIG utility function. 
Comparing the PIG(VI) agent to the PIG(VI+) control, we 
find that they only differ in Mazes, and this difference 



is relatively small compared to the gains made over 
random or greedy behaviors (Fig. |4|. Altogether these 
results suggest that PIG(VI) may be an effective strategy 
employable by embodied agents for coordinating explo- 
rative actions towards learning. 

From the results so far the picture emerges that the 
three classes of environments offer very different chal- 
lenges for the exploring agent. Dense Worlds are easy to 
explore. Mazes require policies that coordinate actions 
over time but exhibit little sensitivity to the particu- 
lar choice in utility function. 1-2-3 Worlds also require 
coordination of actions over time, though to a lesser 
extent than Mazes. Unlike in Mazes, however, agents 
in 1-2-3 Worlds strongly benefit from the information- 
theoretically derived utility function PIG. 

3.4 Structural features of the three worlds 

We next asked how structural differences in the three 
classes of environments correlated with the above dif- 
ferences in exploration performance. In particular we 
considered two structural features of the worlds, their 
tendency to draw agents into a biased distribution over 
states and how tightly an action controls the future states 
of the agent. 

State bias: To assess how strongly a world biases the 
state distribution of agents we consider the equilibrium 
distribution under an undirected action policy, random 
action. The equilibrium distribution vj/ is the limit dis- 
tribution over states after many time steps. To quantify 
the bias of this distribution, we compute a structure index 
(SI) as the relative difference between its entropy H{'^) 
and the entropy of the uniform distribution H{U): 



SI{^) := 



H{U)-Hi^) 



where: 



i/(p(s)):=-^p(s)log2(p(s)) 



ses 



9 



X 


E 3 

CD o 
T3 

E 

LU 



•••••• • " 



• Dense Worlds 

• Mazes 
1-2-3 Worlds 



0.5 

Structure Index 
(a) 



1 




2 4 6 8 10 
Time (steps) 
(b) 

Fig. 5. Quantifying tine structure of tine worlds, (a) The 
embodiment index, defined in Section 3.3, is plotted 
against the structure index for each of 200 Dense Worlds, 
Mazes, and 1-2-3 Worlds, (b) The average controllability, 
as measured by the mutual information between an action 
and a future state, is plotted as a function of the number 
of time steps the state lies in the future (n=200). The error 
bars depict standard deviations. 



The structure index values for 200 worlds in each class of 
environment are plotted against the embodiment index 
(defined in section 3.3) in Fig. |5^. As depicted, the 
embodiment index correlates strongly with the structure 
index. Thus, the state bias seems to represent a signifi- 
cant challenge embodied agents face during exploration. 

Controllability: To measure the capacity for an agent 
to control its state trajectory we computed the mutual 
information between a random action and a future state: 



Ml{AQ,St\sQ] = p{ao,st\so)log2 



p{st\aQ, So) 

P{st\sQ) 



As shown in Fig.jSj?, an action in a Maze or 1-2-3 Worlds 
has significantly more impact on future states than an 
action in Dense Worlds. Controllability is required for 
effective coordination of actions, such as under PIG(VI). 
In Mazes, where actions can significantly effect states 
far into the future, agents yielded the largest gains 
from coordinated actions. However, controllability, while 
necessary, is not sufficient for coordinated actions to have 
the potential of improving learning. For example, a non- 
ergodic world might have high controllability but not 
allow an embodied agent to ever reach a large isolated 
set of states, regardless of whether it coordinated its 
actions or not. In such a world, an unembodied agent 
could reach the isolated states and thereby gain a learn- 
ing opportunity inaccessible to any embodied agent. 

3.5 Comparison to previous explorative strategies 

While exploration in the RL literature has largely focused 
on its role in reward acquisition, many of the principles 
developed to induce exploration can be implemented in 
our framework. In this section, we compare these various 
methods to PIG(VI) under our learning objective. 

Random action is perhaps the most common explo- 
ration strategy used in RL. As we have already seen 
in Fig. |4| random action is only efficient for exploring 
Dense Worlds. In addition to undirected random action, 
the following directed exploration strategies have been 
developed in the RL literature. The learning curves of 
the various strategies are plotted in Fig. |6] 

Least Taken Action (LTA): Under LTA, an agent will 
always choose the action that has been performed least 
often in the current state ||7|, [SS) , (61). Like random 
action, LTA yields uniform sampling of actions in each 
state. Consistently, LTA fails to significantly improve on 
the learning rates seen under random action (p > 0.001 
for all three environments). 

Counter-Based Exploration (CB): Whereas LTA actively 
samples actions uniformly, CB attempts to induce a 
uniform sampling across states. To do this, it maintains 
a count of the occurrences of each state, and chooses its 
action to minimize the expected count of the resultant 
state f7T\. As shown in Fig. |6| CB performs even worse 
than random action in Dense Worlds and 1-2-3 Worlds 
{p < 0.001). It does outperform random actions in Mazes 
but falls far short of the performance seen by PIG(VI) 
{p < 0.001). 

Q-learning on Posterior Expected Information Gain 
(PEIG(Q)): Stork et al. f&T] developed a utility function 
Ustorck to measure past changes in the internal model, 
which they used to guide exploration under a Q-leaming 
algorithm [69 1. Let t be the most recent time step in 
the past over which the internal model for the transition 
distribution &a,s.: changed: 

r := max{t\s{t) — s,a(t) — a,t < |d|} 

Then, considering^ the internal model before and after 
this time step (0^ and 0^+^ respectively), and the 
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Fig. 6. Comparison to previous exploration strategies. Tine average missing information is plotted over time for PIG(VI) 
agents along with three exploration strategies from the literature: least taken action (LTA) |7|, |58|, |61 ], counter-based 
(CB) |71 1, and Q-Learning on posterior expected information gain (PEIG(Q)) |^. The standard control strategies are 
also shown. Standard errors are plotted as dotted lines above and below learning curves. (n=200) 



data collected up to this point d^"*"^, the utility function 
defined by Storck et al. is: 



Us 



torck 



0^ 



(17) 



Note, both and 0^+^ are internal models previously 
(or currently) held by the agent. In the following deriva- 
tion, we demonstrate that Ustorck is equivalent to the 
posterior expected information gained (PEIG). 



PEIG(a,s) :=Eg,|3.+i [Im(0 || 0^) -Im(0 || 

^@a,s.s' logs 



0! 




1 0^ , 

'0^+1 ; 

a,s.s' 







a,s.s' , 



- dkl[0:^^: II 



(18) 



Thus, PEIG is a posterior analogue to our PIG utility 
function. Q-learnuig is a model-free approach to max- 
imizing long-term gains of a utility function [69 1. Fol- 
lowing Storck et al., we tested the combination of PEIG 
and Q-learning (PEIG(Q)) in our test environments. Sur- 
prisingly, PEIG(Q) performs even worse, at least initially, 
than random action in all three environments (p < 0.001 
for CMCs and 1-2-3 Worlds, p > 0.001 for Mazes). As 
such, it fails to yield the learning performance seen by 
PIG(VI) in Mazes in 1-2-3 Worlds. 

Altogether, Fig. |6] demonstrates that PIG(VI) outper- 
forms the previous explorative strategies at learning 
structured worlds. To further compare the principles 
of PIG(VI) and PEIG(Q), we introduce two cross-over 
strategies that borrow from each of them. The first is 
PIG(Q) which applies Q-learning to the PIG utility func- 
tion. The learning performance of PIG(Q) is similar to 
PEIG(Q), falling short of PIG(VI) (Fig. S3). This suggests 



that Q-leaming is ineffective at coordinating actions 
during exploration. The second cross-over strategy is 
PEIG(VI) which applies the VI algorithm to Storck et 
al.'s utility function. PEIG(VI) matched PIG(VI) in Mazes 
{p > 0.001) but not 1-2-3 Worlds {p < 0.001), suggesting 
that the posterior information gain is a reasonable pre- 
dictor for future information gain under a Dirichlet prior 
but not a Discrete prior. 

3.6 Comparison to utility functions from Psychology 

Inspired by independent findings in the field of Psy- 
chology that PIG can describe human behavior during 
hjrpothesis testing, we investigated two other measures 
also developed in this context |44|, |46|. Like PIG, both 
are measures of the difference between the current and 
hjrpothetical future internal models: 

Predicted mode change (PMC) predicts the height dif- 
ference between the modes of the current and future 
internal models ||6J, t44J : 



PMC(a,.s) = ^©,.,,^ 



max 



a,s,s 
a.s.s' 



max 0n 



(19) 

Predicted LI change (PLC) predicts the average LI dis- 
tance between the current and future internal models 
1351: 



PLC(a,s) =^0,,a,. 



— "V I©''''''** - 

J\[ a.,s,s' ^a,s 



(20) 



Note, PMC and PLC differ from PIG in the norm used 
to quantify differences between CMC kernels. Consider- 
ing an arbitrary norm d, the claim analogous to Theorem 
2 would be: 



future 1 1 ^current 



E 



,©|d 



d(© II ©'^ 



d{@ II ©/"*"'■<=) 
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Fig. 7. Comparison between utility functions. Tine average missing information is plotted over time for agents that 
employ VI to maximize long-term gains in the three objective function, PIG, PMC, or PLC. The standard control 
strategies are also shown. Standard errors are plotted as dotted lines above and below learning curves. (n=200) 



This claim states that the expected difference between 
the current and future internal model equals the ex- 
pected change in difference with respect to the ground 
truth. While this claim holds when d is the norm used 
in PIG (Theorem 2), it does not generally hold for either 
of the norms used in PMC or PLC. 

To our knowledge, neither PIG, PMC nor PLC have 
previously been applied to sequences of observations or 
to embodied action perception loops. We tested agents 
that attempt to maximize PMC or PLC using VI. As Fig. 
|7] reveals, PIG (VI) proved again to be the best performer 
overall. In particular, PIG(VI) significantly outperforms 
PMC(VI) in all three environments, and PLC(VI) in 1- 
2-3 Worlds {p < 0.001). Nevertheless, PMC and PLC 
achieved significant improvements over the baseline 
control in Mazes and 1-2-3 Worlds, highlighting the ben- 
efit of value iteration across different utility functions. 
Interestingly, when performance was measured by an 
LI distance instead of missing information, PIG(VI) still 
outperformed PMC(VI) and PLC(VI) in 1-2-3 Worlds 
(Fig. S4). 

3.7 Generalized utility of exploration 

From a behavioral perspective, learning represents a fun- 
damental and primary drive |2|, |37|. The evolutionary 
advantage of such an exploratory drive likely rests on 
the general utility of the acquired internal model |33j, 
||50|, 1 51 1, 1 55 1, |56|. To test this, we assessed the ability of 
the agents to use their internal models, derived through 
exploration, to accomplish an array of goal-directed 
tasks. We consider two groups of tasks: navigation and 
reward acquisition. 

Navigation: Starting at any given state, the agent has 
to navigate to any given target state with the minimal 
number of steps. 

Reward Acquisition: For every starting state, the agent 
has to gather as much reward as possible over 100 time 
steps. Reward values are drawn from a standard normal 
distribution and randomly assigned to every state in the 



CMC. Each agent is tested in ten randomly generated 
reward structures. 

At several time points during exploration, the agent 
is stopped and its internal models assessed for general 
utility. For each task, we next derive the behavioral 
policy that optimizes performance under the internal 
model. The derived policy is then tested in the world 
(i.e. under the true CMC kernel), and the expected path 
length or acquired reward for that policy is determined. 
As a positive control, we also derive an objective optimal 
policy that maximizes the realized performance for the 
true CMC kernel. The difference in realized performance 
between the subjective and objective policies is used 
as a measure of navigational loss or reward loss. High 
navigational loss means the agents policy took many 
more time steps to reach the target state than the optimal 
policy. High reward loss means the agents policy yielded 
significantly fewer rewards than the optimal policy. 

Fig. |8] depicts the average rank in navigational and 
reward loss for the different explorative strategy. Signif- 
icance bounds {p — 0.001) around PIG(VI) were deter- 
mined by post-hoc analysis of Friedman's test |25|. In 
all environments, for both navigation and reward acqui- 
sition, PIG(VI) always grouped with the top performers 
(p > 0.001), excepting positive controls. PIG(VI) was the 
only strategy to do so. Thus, the explorative strategy 
that optimized learning under the missing information 
objective function gave the agent an advantage in a 
range of independent tasks. 

4 Discussion 

In this manuscript we introduced a parsimonious math- 
ematical framework for studying learning-driven explo- 
ration by embodied agents based on information the- 
ory, Bayesian inference and controllable Markov chains 
(CMCs). We compared agents that utilized different 
exploration strategies towards optimizing learning. To 
understand how learning performance depends on the 
structure of the world, three classes of environments 
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Fig. 8. Demonstration of generalized utility. For each world (n=200), explorative strategies are ranked for av- 
erage navigational loss (averaged across N start states and N target states) and average reward loss (aver- 
aged across N start states and 10 randomly generated reward distributions). The average ranks are plotted 
with standard deviations. Strategies lying outside the pair of solid green lines differ significantly from PIG(VI) in 
navigational loss. Strategies lying outside the pair of solid blue lines differ significantly from PIG(VI) in reward loss 
(p < 0.0001). The different utility functions and heuristics are distinguished by color: PIG(green), PEIG(magenta), 
PMG(blue), PLG(cyan), LTA(orange), CB(yellow). The different coordination methods are distinguished by symbol: 
Greedy(squares), Vl(circles), VlH-(diamonds), Heuristic Strategies(asterisks). The two standard controls are depicted 
as follows: Unembodied(black), Random(red). 



were considered that challenge the learning agent in 
different ways. We found that fast learning could be 
achieved by an exploration strategy that coordinated 
actions towards long-term maximization of predicted 
information gain (PIG). 

4.1 Potential limitations to our approach 

The optimality of the Bayesian Estimate (Theorem 1) 
and the accuracy of PIG (Theorem 2) both require a 
prior distribution on the transition distributions. For 
biological agents, such priors could have been learned 
from earlier exploration of related environments, or may 
represent hardwired beliefs optimized by evolutionary 
pressures. As another possibility, an agent could attempt 
to simultaneously learn a prior while exploring its envi- 
ronment. Indeed, additional results (Fig. S5) show that 
the maximum-likelihood estimation of the concentration 
parameter for Dense Worlds and Mazes enables explo- 
ration that quickly matches the performance of agents 
given accurate priors. Nevertheless, biological agents 
may not always have access to an accurate prior for an 
environment. For such cases, future work is required to 
understand exploration under false priors and how they 
could yield sub-optimal but perhaps biologically realistic 
exploratory behaviors. 

As another potential limitation of our approach, the 
VI algorithm is only optimal for dynamic processes with 
known stationary transition probabilities and stationary 
utilities [8|. In contrast, any utility function, including 
PIG, that attempts to capture the progress in learning of 
an agent will necessarily change over time. This caveat 
may be partially alleviated by the fact that PIG changes 
only for the sampled distributions. Furthermore, PIG 
decreases in a monotonic fashion (see Fig.l2| which could 



potentially be captured by the discount factor of VI. 
Interesting future work may lie in accounting for the 
effect of such monotonic decreases in estimates of future 
learning gains. 

In addition, the learning agent does not have access to 
the true transition distributions for performing VI and 
has to rely instead on its evolving internal model. The 
impairment caused by this reliance on the internal model 
was directly assessed with a positive control PIG(VI+). A 
comparison of PIG(VI) against this control (Fig. |4| shows 
performance impairment only in Mazes and it is rather 
small compared to the improvements offered by VI. 

Finally, it might be argued that the use of missing 
information as a measure of learning unfairly advan- 
taged the PIG utility function. Interestingly, however, 
PIG under VI was not only the fastest learner, but also 
demonstrated the greatest capacity for accomplishing 
goal-directed tasks. Furthermore, it even outperformed 
other strategies, including PLC(VI), under an LI objec- 
tive function (Fig. S4). 

4.2 Related work in Reinforcement Learning 

CMCs are closely related to the more commonly studied 
Markov Decision Processes (MDPs) used in Reinforce- 
ment Learning. MDPs differ from CMCs in that they 
explicitly include a stationary reward function associated 
with each transition |19|, |69|. RL research of exploration 
usually focusses on its role in balancing exploitative 
behaviors during reward maximization. Several methods 
for inducing exploratory behavior in RL agents have 
been developed. Heuristic strategies such as random 
action, least taken action, and counter-based algorithms 
are commonly employed in the RL literature. While such 
strategies may be useful in RL, our results show that they 
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are inefficient for learning the djmamics of structured 
worlds. 

In contrast to these heuristic strategies for exploration, 
several principled approaches have been proposed for 
inducing exploratory actions to maximize rewards. For 
example, the BEETLE algorithm models reward as a 
partially observable MDP and derives an analytic so- 
lution to optimize rewards |53|. Similarly, the BOSS 
approach maintains a posterior distribution over MDPs 
from which it periodically samples for selecting actions 
that maximize reward gains "optimistically" from the 
samples |3|. These strategies focus exclusively on ex- 
trinsically motivated exploration and do not address 
exploration driven by learning for its own sake. 

Finally, several studies have investigated intrinsically 
motivated learning under the RL framework. For ex- 
ample, Singh et al. |'63l have demonstrated that RL 
guided by saliency, an intrinsic motivation derived from 
changes in stimulus intensity, can promote the learning 
of reusable skills. As mentioned previously, Storck et al. 
introduced the combination of Q-learning and PEIG as 
an intrinsic motivator of learning | [67| . In their study, 
PEIG(Q) outperformed random action only over long 
time scales. At shorter time scales, random action per- 
formed better. Interestingly, we found exactly the same 
trend, initially slow learning with eventual catchrng-up, 
when we applied PEIG(Q) to exploration in our test 
environments (Fig. 6). 

4.3 Related work in Psychology 

In the Psychology literature, PIG, as well as PMC and 
PLC, were directly introduced as measures of the ex- 
pected difference between a current and future belief |6|, 
f35|, |44|, |46|. Here, in contrast, we derived PIG, using 
Bayesian inference, from the expected change in missing 
information with respect to a ground truth (Theorem 2). 
Analogous theorems do not hold for PMC or PLC. For 
example, the expected change in LI distance between an 
internal model and the true structure is not equivalent 
to the expected LI distance between successive internal 
models. This might explain why PIG(VI) outperformed 
PLC(VI) even under an LI measure of learning (Fig. S4). 

We applied the PIG principle to the learning of a full 
model of the world. The Psychology literature, in con- 
trast, focusses on specific questions (hypothesis testing) 
regarding the data. In addition, this prior literature has 
not considered sequences of actions or embodied sensor- 
motor loops. 

It has been shown that human behavior during hy- 
pothesis testing can be explained by a model that 
maximizes PIG l|44j, |j46). This suggests that the PIG 
information-theoretic measure may have biological sig- 
nificance. The behavioral studies, however, could not 
distinguish between the different utility functions (PIG, 
PMC and PLC) in their ability to explain human behav- 
ior |44|. Perhaps our finding that 1-2-3 Worlds give rise 
to large differences between the three utility functions 



can help identify new behavioral tasks for disambiguat- 
ing the role of these measures in human behavior. 

Itti and Baldi recently developed an information the- 
oretic measure closely related to PEIG for modeling 
bottom-up visual saliency and predicting gaze attention 
\5j, [29J , |30J. In their model, a Baysian learner maintains 
a probabilistic belief structure over the low-level features 
of a video. Attention is believed to be attracted to loca- 
tions in the visual scene that exhibit high Surprise. Like 
PEIG, Surprise quantifies changes in posterior beliefs by 
a summed Kullback-Leibler divergence. Several poten- 
tial extensions of this work are suggested by our results. 
First, it may be useful to model the active nature of 
data acquisition during visual scene analysis. In Itti and 
Baldi's model, all features are updated for all location 
of the visual scene regardless of current gaze location or 
gaze trajectory. Differences in accuity between the fovea 
and periphery however suggest that gaze location will 
have a significant effect on which low-level features can 
be transmitted by the retina |74|. Second, our comparison 
between the PIG and PEIG utility functions (Figs. 6 and 
S3) suggests that predicting where future change might 
occur, may be more efficient than focusing attention only 
on those locations where change has occured in the past. 
A model that anticipates Surprise, as PIG anticipates 
information gain, may be better able to explain some 
aspects of human attention. For example, if a moving 
subject disappears behind an obscuring object, viewers 
may anticipate the reemergence of the subject and attend 
the far edge of the obscurer. Incorporating these insights 
into new models of visual saliency and attention could 
be an interesting course of future research. 

4.4 Information-theoretic models of behavior 

The field of behavioral modeling has recently seen 
increased utilization of information- theoretic concepts. 
These approaches can be grouped under three guiding 
principles. The first group uses information theory to 
quantify the complexity of a behavioral policy, with high 
complexity generally considered undesirable. Tishby and 
Polani for example, considered RL maximization of re- 
wards under such complexity constraints | j73| . While 
we did not consider complexity constraints on our be- 
havioral strategies in the current work, it may be an 
interesting topic for future studies. 

The second common principle seeks to maximize pre- 
dictive information |4|, |66|, \72\ (not to be confused with 
predicted information gain, PIG). Predictive informa- 
tion, which has also been termed excess entropy |14|, 
estimates the amount of information a known variable 
(or past variable) contains regarding an unknown (or 
future) variable. For example, in simulated robots. Ay 
et al. demonstrated that complex and interesting behav- 
iors can emerge by choosing control parameters that 
maximize the predictive information between successive 
sensory inputs |4|. The information bottleneck approach 
introduced by Tishby et al. ]72| combines predictive 
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information and complexity constraints, maximizing the 
information between a compressed internal variable and 
future state progression subject to a constraint on the 
complexity of generating the internal variable from sen- 
sory inputs. Recently, Still extended the information 
bottleneck method to incorporate actions |66|. 

Both Ay et al. and Still describe the behaviors that 
result from their models as exploratory. Their objective 
of high predictive information selects actions such that 
the resulting sensory input changes often but in a pre- 
dictable way. We therefore call this form of exploration 
stimulation-driven. Predictive information can only be 
high when the sensory feedback can be predicted, and 
thus stimulation-driven exploration relies on an accurate 
internal model. In contrast, the learning objective we 
introduced here drives actions most strongly when the 
internal model can be improved and this drive weakens 
as it becomes more accurate. Thus, learning-driven and 
stimulation-driven exploration contrast each other while 
being very interdependent. Indeed, a simple additive 
combination of the two objectives may naturally lead to 
a smooth transitioning between the two types of explo- 
ration, directed by the expected accuracy of the internal 
model. In the next section we suggest a correspondence 
of these two computational principles of exploration 
with two distinct modes of behavior distinguished in 
psychology and behavioral research. 

Finally, the Free-Energy (FE) h5^othesis introduced by 
Friston proposes that the minimization of free-energy an 
information-theoretic bound on surprise, offers a unified 
variational principle for governing both the learning 
of an internal model as well as actions flT'l. Friston 
notes that under this principle agents should act to 
minimize the number of states they visit. This stands in 
stark contrast to both learning-driven and stimulation- 
driven exploration. During learning-driven exploration, 
an agent will seek out novel states where missing infor- 
mation is high. During stimulation-driven exploration, 
an agent will actively seek to maintain high variation 
in its sensory inputs. Nevertheless, as Friston argues, 
reduced state entropy may be valuable in dangerous 
environments where few states permit survival. The 
balance between cautionary and exploratory behaviors 
would be an interesting topic for future research. 

4.5 Towards a general theory of exploration 

With the work of Berlyne [11 1, Psychologists began to 
dissect the complex domains of behavior and motiva- 
tion that comprise exploration. A distinction between 
play (or diversive exploration) and investigation (or 
specific exploration) grew out of two competing theories 
of exploration. As reviewed by Hutt |27|, "curiosity"- 
theory proposed that exploration is a consummatory 
response to curiosity-inducing stimuli |9|, p2| . In con- 
trast, "boredom"-theory held that exploration was an 
instrumental response for stimulus change |20|, |43|. To 
ameliorate this opposition, Hutt suggested that the two 



theories may be capturing distinct behavioral modes, 
with "curiosity"-theory underlying investigatory explo- 
ration and "boredom"-theory underlying play. In chil- 
dren, exploration often occurs in two stages, inspection 
to understand what is perceived, followed by play to 
maintain changing stimulation |28|. These distinctions 
nicely correspond to the differences between our ap- 
proach and the predictive information approach of Ay 
et al. [4| and Still |66|. In particular, we hypothesize 
that our approach, which emphasizes the acquisition of 
information, corresponds to curiosity-driven investiga- 
tion. In contrast, we propose that predictive information 
a la Ay et al. and Still, which rehearses the internal 
model in a wide range, may correspond with play. 
Further, the proposed method of additively combining 
these two principles (Section 4.4), may naturally capture 
the transition between investigation and play seen in 
children during exploration. 

Even in the domain of curiosity-driven exploration, 
there are many varied theories |37|. Early theories 
viewed curiosity as a drive to maintain a specific level 
of arousal. These were followed by theories interpret- 
ing curiosity as a response to intermediate levels of 
incongruence between expectations and perceptions, and 
later by theories interpreting curiosity as a motivation 
to master one's environment. Loewenstein developed an 
Information Gap Theory and suggested that curiosity is 
an aversive reaction to missing information |37| . More 
recently, Silvia proposed that curiosity is composed of 
two appraisal components, complexity and comprehen- 
sibility. For Silvia complexity is broadly defined, and in- 
cludes novelty, ambiguity, obscurity, mystery, etc. Com- 
prehensibility appraises whether something can be un- 
derstood. It is interesting how well these two appraisals 
match information-theoretic concepts, complexity being 
captured by entropy, and comprehensibility by infor- 
mation gain p9) . Indeed, predicted information gain 
might be able to explain the dual appraisals of curiosity- 
driven exploration proposed by Silvia. PIG is bounded 
by entropy and thus high values require high complexity. 
At the same time, PIG equals the expected decrease 
in missing information and thus may be equivalent to 
expected comprehensibility. 

All told, our results add to a bigger picture of explo- 
ration in which the theories for its different aspects fit 
together like pieces of a puzzle. This invites future work 
for integrating these pieces into a more comprehensive 
theory of exploration and ultimately of autonomous 
behavior. 
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Fig. S1 . Example Dense World. Dense Worlds consist of 4 actions 
(separately depicted) and 10 states (depicted as nodes of the graphs). The 
transition probabilities associated with taking a particular action are depicted 
as arrows pointing from the current state to each of the possible resultant 
states. Arrow color depicts the likelihood of each transition. 



a=1 a=2 




Fig. S2. Example 1-2-3 World. 1-2-3 Worlds consist of 3 actions 
(separately depicted) and 20 states (depicted as nodes of the 
graphs). The transition probabilities associated with taking a particu- 
lar action are depicted as arrows pointing from the current state to 
each of the possible resultant states. Arrow color depicts the likeli- 
hood of each transition. The absorbing state is depicted in gray. 
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Fig. S3. Comparison between different features of current and previous exploration strategies. The Average missing infor- 
mation is plotted over time for agents that apply either VI (circles) or Q-learning (triangles) towards maximization of either 
PIG (green) or PEIG (magenta). Standard control strategies are also shown. Standard errors are plotted as dotted lines 
above and below learning curves. (n=200) 




Fig. S4. Comparison between utility functions under LI objective. The average LI distance is plotted over time for agents 
that coordinate actions using VI to maximize long-term gains in PIG, PMC, or PLC. Standard control strategies are also 
shown. Standard errors are plotted as dotted lines above and below learning curves. (n=200) 
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Fig. S5. Inferring the concentration parameter during learning. Maximum Likelihood estimatation 
of the concentration parameter, a, is performed from data collected during exploration. A maxi- 
mum concentration of a=20 is imposed. (a,b) The mean error in inferred concentration parameter 
over time is plotted for Dense Worlds and Mazes. (c,d) The missing information for a PIG(VI) 
explorer updating its internal model using the true (green with circles) or inferred (purple with 
stars) over time is plotted for Dense Worlds and Mazes. Standard control explorers (with a given) 
have been included. Dotted lines above and below learning curves depict standard errors. Notice, 
even when required to infer the appropriate concentration parameter, the explorer is still able to 
quickly learn an accurate internal model. 



